Efficient spatial and channel net for lane marker detection based on self-attention and row anchor

Lane detection is an important component of advanced driving aided system (ADAS). It is a combined component of the planning and control algorithms. Therefore, it has high standards for the detection accuracy and speed. Recently several researchers have worked extensively on this topic. An increasing number of researchers have been interested in self-attention-based lane detection. In difficult situations such as shadows, bright lights, and nights extracting global information is effective. Regardless of channel or spatial attention, it cannot independently extract all global information until a complicated model is used. Furthermore, it affects the run-time. However trading in this contradiction is challenging. In this study, a new lane identification model that combines channel and spatial self-attention was developed. Conv1d and Conv2d were introduced to extract the global information. The model is lightweight and efficient avoiding difficult model calculations and massive matrices, In particular obstacles can be overcome under certain difficult conditions. We used the Tusimple and CULane datasets as verification standards. The accuracy of the Tusimple benchmark was the highest at 95.49%. In the CULane dataset, the proposed model achieved 75.32% in F1, which is the highest result, particularly in difficult scenarios. For the Tusimple and CULane datasets, the proposed model achieved the best performance in terms of accuracy and speed.

Autonomous driving is a complex process that involves a variety of sensors such as cameras, lidar and radar and requires increasingly complex models and algorithms.The aim is to fully understand the environment in order to be able to take appropriate measures.An extremely important part of vehicle control is the lane marking line.Some processes such as lane keeping and highway assistance are highly dependent on it.In addition, it is essential for regional planning and vehicle control.Therefore, increasing importance is being given to improving the time-of-flight response and accuracy of vision-based lane marking detection [1][2][3][4][5][6] .
The fundamental problem in lane marking identification, as previously mentioned [7][8][9] , is how to accurately detect the lane line under difficult circumstances.As a result of a lack of visual cues, including significant vehicle occlusion, harsh lighting, shadows and wet conditions, errors or false alarms go unnoticed.Traditional visionbased methods 10 are mainly based on hand-crafted features, gray images, ROIs and various edge detection operators such as SIFT 11 and SURF 12 .However, their ability to adapt to difficult weather conditions and harsh lighting conditions is inadequate.This prevents wide generalization and use.CNN has attracted a lot of attention in recent years.It works well in extracting features.However, to achieve high performance in classification and regression, it is necessary to make a trade-off between the receptive field and the network depth.A2-Net 13 , Squeeze and Excitation Networks 14 , CBAM 15 and Gather-Excite 16 are examples of attention and self-attention mechanisms 17,18 that have been developed and advanced using technology support the detection of lane markings.It can spatially focus attention on multiple areas or channel attention to comprehensively extract broad information.
Spatial self-attention focuses on spatial relationships rather than channel co-relations.Instead, channel selfattention emphasizes channel rather than spatial dependence and we thoroughly examine the benefits of channel attention and spatial attention to understand many facets of self-attention.We suggest that ESCN is an effective spatial and channel network.The main contributions of our proposed model are summarized as follows: 1.A brand new ESCN mechanism.To build a novel ESCN model, we merged spatial and channel self-attention based on the anchor representations.It can fully utilize channel and spatial correlations simultaneously to extract global information, especially under difficult conditions.
1. Tradition approaches based on vision.The primary technologies at this level are vision-based methods [19][20][21][22][23][24][25][26] .It includes three sectors: the model-based approach in one, the feature-based method in the other two, and the region-based method in the third.Image segmentation, vanishing point selection, orientation estimation and lane detection are the four processes that typically involve model-based techniques.During the image segmentation step, the entire image was divided into a near field and a far field as separate ROIs.Uses 27 the Gaussian model 28 and maximum likelihood to estimate the vanishing point and presents the Gabor filter to estimate the orientation.The Canny edge detector, Hough transform, Catmull-Rom spline, spline model, cubic spline, IPM and particle filter are also used in track-bound detection.The three feature-based techniques include feature extraction, line detection and tracking.To extract features from an ROI 29 , a local thresholding technique was proposed that uses template matching for line detection.The EKF was proposed in [30][31][32][33] for lane marking tracking.The region-based approach includes both region finding and feature tracking.The Shi-Tomasi method was proposed by 33 for feature extraction, while the Lucas-Kanade tracker and optical flow algorithm were proposed separately by 34 and 35 , respectively.2. Segmentation approach using CNN.Several groups are currently working intensively on applying CNN techniques 36,37 to lane marker detection 38,39 .Similar to R-CNN, CNNs are data-driven and suitable in feature extraction 40 .Although deep learning-based methods such as CNN 41 include numerous convolution and pooling layers, they cannot fully utilize the information and context, especially in difficult situations such as occlusion, lane marking degradation, and changing road conditions.Semantic segmentation 42 and instance segmentation 43 have been proposed as solutions to this problem 44 proposed pixel-level semantic segmentation to identify lane markings as a step in semantic segmentation.Proposes a UNet-based weakly supervised lane marking detection network 45 .In contrast to semantic segmentation 46 , presents an end-to-end lane mark detection based on instance segmentation, which consists of a lane segment branch and a lane embedding branch to increase the speed of lane mark detection.Proposed a fast structured track identification network that selects regions with given lines instead of the entire image to avoid extensive processing 47 .3. With CNN + attention.Detecting lane markings in difficult situations is a significant problem.CNN + segmentation techniques were successful but encountered significant challenges.For example, significant computational effort is required to use semantic segmentation-based methods, and the accuracy of lane lines and the number of lane lines need to be promoted and improved.Therefore, attention has been paid to attention-based methods for lane marking detection [48][49][50][51][52][53] .In 52 , an ESA module based on encoder and decoder architecture was proposed.In order to be able to determine the position of the occlusion more precisely, HESA and VESA were integrated.Proposed to use spatial attention to collect boundary information across multiple locations and channel attention in the GCE module to extract information about the global context 50 .For U-Net 51 , proposed residual blocking and attention mechanisms.
Comparisons between the above methods can be found in Table 1.

Proposed approach
System overview A2-Net 13 , Squeeze and Excitation Networks 14 , CBAM 15 and Gather-Excite 16 are examples of attention and selfattention mechanisms 17,18 that have been developed and advanced using technology support the detection of lane markings.It can spatially focus attention on multiple areas or channel attention to thoroughly extract global information.
Spatial self-attention focuses on spatial relationships rather than channel connections.Instead, channel selfattention emphasizes channels rather than spatial dependence, and we thoroughly examine the benefits of channel and spatial attention to understand many facets of self-attention.We propose that the ESCN is an effective spatial and channel network, as shown in Fig. 1.The main contributions of our proposed model are summarized as follows: As mentioned above, lane marking detection is difficult to solve in challenging scenarios such as severe lane erosion, strong shadow and vehicle occlusion.To overcome these problems, we introduce a lightweight and efficient channel attention model that extracts feature maps from DCNN as inputs.To obtain global semantic and contextual information, we use cross-channel to match anchor vectors in its own channel and its neighbors.It can capture cross-channel interactions to learn effective and efficient channel attention while avoiding dimensionality reduction, as shown in Fig. 1.Therefore, it can summarize and abstract all this global information without changing the receptive field.In addition, it promotes classification precision and location accuracy at the same time. www.nature.com/scientificreports/

Backbone
We used ResNet34 as the backbone of our proposed ESCN.There are four different types of residual blocks.The size of their convolution kernel was 33, and their individual kernel and channel numbers were 64, 128, 256, and 512, respectively.This successfully prevented the gradient from disappearing or exploding.

Efficient channel and spatial attention block
The heart of the ESCN is an efficient channel and spatial attention block.There are two types of attention systems.One is an effective block of channel attention, the other is an effective block of spatial attention.This combination is about placing global contextual information alongside global location information in a single channel.Therefore, lane marking features can be effectively extracted even in difficult situations.
1. Efficient Channel Attention Block.After extracting the feature maps using of ResNet34, the local feature map anchor local ∈ R C×H×W served as the input.C, H, W indicate the channel number, feature-map height and feature-map width respectively.Then global average pooling operates on it as shown in (1): where k = 1, 2, 3, . . ., C and y k = f (X k ) .So we get Y = y 1 , y 2 , y 3 , . . .y i , . . ., y j , . . ., y C T .
(1) We also know that many parameters are involved in the linear transformation.Regardless of whether it is a full or diagonal matrix, this results in numerous computations.To avoid this, we propose a 1D convolution with kernel size k .It is shown as follows: where Relu indicates the Rectified Linear Unit and C1D k is the 1D convolution which involves k parameters.Therefore it reduces the number of parameters and computation time.It can be easily observed that k represents the local cross-channel interaction.This is a key factor in reducing the parameter quantity.To avoid manual tuning, we adopted the following adaptive expression 54 : where k is an odd number and C indicates the channel number.In our proposed model, we set γ and b as 2 and 1 respectively.The detailed architecture is shown in Fig. 3. Finally, we obtain the output of efficient channel attention which indicates anchor channel_attention .The structure is shown as Fig. 2.
2. Efficient Spatial Attention Block.We know that the local feature map anchor local is put into channel attention and spatial attention blocks.Then in the efficient spatial attention block the local feature map anchor local is applied with Maxpool and AvgPool operations as follows: where the anchor spacial,max pool i,j indicates the value at location (i, j) after the maxpool operation.And we get the feature map anchor spatial,max _pool .Equation ( 5) is expressed as follows: Avgpool is shown as follows: (2) are the location value positions (i, j) .After the avgpool operation we obtained the feature map anchor spatial,avg_pool .
After we obtain the anchor spatial,max _pool and anchor spatial,avg_pool , we concatenated them.Then we apply Conv2d with kernel size 3 and a sigmoid function Finally we acquire the output of the efficient spatial attention block which is the anchor spatial_attention .The detailed architecture is illustrated in Fig. 3.

Classification network and regression network
anchor global is input into Classification networks and Regression networks separately.Each network passes through a linear layer and reshaping operation.They then join together to become a tensor proposals proposals ∈ R batch×anchors×(K+n_offset) where K is the classification number and n_offset represents the offset number in X coordinate frame.Finally the proposals are performed iteratively using a non-maximum suppression NMS (Non Maximum Suppression) operation in batch dimensions.The softmax operation is performed to score[:, 2](scores ∈ R anchors×(K+n_offset) which in our study K is set as 2. Therefore in anchor rows we find out whose probabilities are greater than conf_threshold, which is a possible threshold for judging whether it is a lane marking.In this way we obtained the classification results.We also find the anchor position index, which represents the regression result, because each classification and regression component is in the same row and positions different columns.The detailed architecture of the classification and regression networks is shown in Fig. 4. The loss function is given by ( 8): where c i , a i are the prediction results of the classification and regression respectively, c * i , a * i are the ground truths for the anchor i .N a is the total number of anchors.k c , k r are the coefficients of the classification and regression loss functions respectively, and are used to balance the loss value.In the proposed model k c = 10 and k r = 1 .Meanwhile we also set class to Focal Loss 54 and take as Smooth L1 individually.

Experiments Dataset
To demonstrate the effectiveness of the model and evaluate the results of our proposed methodology, we used two commonly used benchmark datasets, TuSimple 55 and CULane 1 .Most highway scenarios of the TuSimple dataset.Due to the uniform illumination, it is much easier to detect the lane marking line, while the CULane dataset is far more complicated than the previous one.Nine difficult scenarios were considered: crowd, no queue, normal, blinding night, shadow, curve and arrow in city and highway environments.Table 2 provides a detailed explanation of the two data sets.

Evaluation metrics
The TuSimple and CULane benchmarks use different evaluation metrics.Accuracy served as the evaluation standard for the TuSimple benchmark.
For the CULane benchmark, the final evaluation metric was the F1 combined with two other metrics:Precision and Recall.www.nature.com/scientificreports/

Implementation details
In the experiment, all images were resized to 360 × 640 pixels for TuSimple and CULane respectively.Therefore H and W were set as 360 and 640 respectively.The epoch was set to 50 for the TuSimple benchmark and 15 for CULane.The batch size was set to eight and the learning rate was set to 0.0003 using the Adam optimizer.We use pthon3.7,pytorch 1.6.0,cuda 10.1 and Cudnn 7.2 as the experimental environment.
The accuracy of the proposed model is 95.49% for the TuSimple benchmark and 95.12% for the top technique.The accuracy of our model was improved by 0.37% compared to the other techniques.The proposed model had FP and FN values of 0.0307 and 0.0342, respectively.The former has the lowest values for all methods and there is hardly a gap of 0.0118.Thus, the proposed model is the most effective among all methods.Furthermore, it was performed on the state-of-the-art TuSimple dataset, as shown in Table 3.

Results on CULane dataset.
With the CULane benchmark, we also know that the accuracy of our proposed model outperforms all other methods.In nine challenging scenarios, it outperforms all other methods, namely 75.67%.We can see that it increases by 13.48% in the total scenario compared to ResNet-18 4 and increases by 4.17% for R-34-E2E, which is the highest among all methods.In the cross scenario, the values are FP, the value of our model is also the lowest.We know for embedded system run time is priority.Although the inference speed of our model is slower than ResNet-18, it significantly outperforms its accuracy.For embedded systems, real-time performance is a top priority.According to the comparison of FPS indicators in Table 4, the speed of our recommended model is better than most models.In the FPS comparison, our proposed model is 82.84 faster than the slowest model and only 33.12 slower than the fastest model.However, the accuracy of our model is 13.14% higher than that of the fastest model.From Fig. 5, we can easily see that the lane marking line detection results of our proposed model are better than those of the other visualization methods.For example, from the visualization results, it is not difficult to see that other models predict lane lines.Most lane lines have jitter and position deviation, and the lane lines cannot have a parallel relationship.Therefore, it is obvious that our proposed model also achieves the state-of-the-art performance on the CULane benchmark as shown in Table 4.

Ablation study
Conv1d and Conv2d were used instead of the full matrix in our proposed model.Our goal is to reduce the number of calculations and parameters to see if we can achieve the effect of not significantly reducing the accuracy of the model.Judging from the actual comparison results in Table 5, when the channel-related self-attention mechanism is optimized, the F1 index is only reduced by 0.35, which is about 0.46%.Table 5 shows that our ESCN model performs better than the entire matrix model in the TuSimple test.Accuracy improved by 0.19%.FP and FN received promotions.The values increased by 0.0015 and 0.0026 respectively.In the CULane benchmark, we can also see that although F1 decreased by 0.45%, as shown in Tables 5 and 6, convolution was used instead of matrix calculation, which significantly reduced the calculation.
From Fig. 6, we can easily see that the loss parameter changes rapidly at the beginning of the training phase, regardless of whether it is the Tusimple dataset or the CULane dataset.The loss change gradually stabilizes for the Tusimple dataset.However, the loss changes of the CULane dataset are still quite intense.For the learning rate parameter, whether it is the Tusimple data set or the CULane data set, their changes are basically the same, they gradually become smaller and then gradually increase.

Conclusion
In this study, we propose an effective spatial and channel network aimed at detecting lane marking lines, especially in difficult environments.We used channel self-attention, which deals with global and contextual information in a single channel, and spatial self-attention, which focuses more on location in many channels, to identify  aspects that are likely to be missed in difficult situations.We evaluated our proposed model using the TuSimple and CULane benchmarks.With an individual advantage of 3.82% and 0.37% over all other methods.Undoubtedly, it delivers excellent performance.However, there are also three limitations in our proposed model.First, we only simplify the channel-wise attention model through a global average pooling operation.Although it promotes inference speed, it reduces accuracy compared to the fully connected matrix.Second, we take the same measures for the spatial channel.As a result, global information is easily ignored and the relationship between different pixels is missing.In addition, the kernel size limits the receptive field.Finally, our proposed model pays significantly more attention to the information in an image while neglecting the associated connection between continuous images.We now also understand that the self-attention mechanism is only able to retrieve global information across channels and locations, but not semantic and contextual information between frames.Accuracy in the lane marking detection phase is crucial for autonomous driving.Runtime applications on embedded platforms are a crucial part of future research.In future work, we will consider a comprehensive technology to construct our model.Like rnn, lstm, semantic segmentation and instance segmentation, we will combine them with attention or self-attention to combine their advantages and obtain an improved model.If we don't limit ourselves to compatibility for embedded systems, we also consider large models.

Figure 4 .
Figure 4.The Architectures of Classification Network and Regression Network.

Figure 5 .
Figure 5.Comparison between our proposed model and other methods in visualization based on CULane benchmark.

Figure 6 .
Figure 6.The curve of loss parameter and learning rate parameter in TuSimple dataset and CULane dataset.

Table 1 .
Comparisons among different methods of lane mark line detection.

Table 2 .
Overview of dataset description used in this paper.

Table 3 .
Comparison between our model and other methods based on TuSimple dataset.

Table 4 .
Accuracy comparison between our model and other methods based on CULane dataset.Significant values are in [bold].

Table 5 .
Ablation comparison on TuSimple benchmark dataset.Significant values are in [bold].

Table 6 .
Ablation comparison on CULane benchmark dataset.Significant values are in [bold].